Introduction

Our Question

As students of economics, we have studied many different industries and market structures in order to understand how people try to maximize their profits. In many of the models that we have studied for our prior classes, pricing of goods and services was straightforward, and we had clear evidence showing why a producer would want to sell their product at certain quantities and prices. But when it comes to airlines selling flights how firms maximize their profit gets a little more ambiguous. We know that airlines compete as an oligopoly market, so they want to collude but can’t, which begs the question what is their next best alternative? What methods can they employ to get an edge over their competitors and maximize their profit when their best option is unavailable? Anyone that has tried to buy a ticket for an airplane has probably realized that not every ticket is priced the same and that some people get better deals than others. We know that one method that airlines use to maximize profit is price discrimination, but that doesn’t tell us anything about their revenue relative to their costs. In other words, knowing that a businessman will typically pay more for a ticket does not tell us all that much about what kind of money the airline is going to make overall. Are airlines getting the most they can out of flights, and what incentives are there to offer deals and discounts? These are important questions when we want to learn more about the industry.

We also want to know what kind of deals that airlines make to increase profit, for example we will look at if airlines offer bulk discounts for groups buying tickets in large quantities. Or if the airlines will offer slightly better deals on round-trip flights so they can make money by taking a person to a place and bringing them back. These are some standard practices offered in some other industries and we wanted to see if these are applicable to this industry like others, and if that is the case then it might give us as consumers new ideas on how to exploit those deals offered by the airlines.

To state the plan explicitly, we will be taking airline data and will attempt to use flight characteristics to create a regression to determine what kind of revenue airlines will be able to make off a given flight. One of our main variables we will be trying to predict is itinerary yield, which represents the amount of revenue that a flight will bring in. We will look at how the revenue is affected by distance and how much a flight can make per mile. That will give us an idea of whether or not airlines offer discounted rates to those flying long distances. Our hope is that our regression will be able to answer all these questions and more, as well as provide a stable model for future research. Or at the very least, will provide a rough basis to be polished and refined further at a later date.

Original Study

Below is a visual that shows the type of study we originally wanted to conduct. Their data is not open source and because finding data of this type was not possible we changed the data and remainder of the project.

We can see a rough answer to our original question, “When is the cheapest time to buy airline tickets? Does pricing change significantly as demand for flights varies? Do airlines vary price to combat strategic consumers?”… Yes ticket price is extremely elastic.

Cheapair.com

Data Source: Source


Literature Review

In our search to provide an answer to “what factors affect airline price discrimination? We first need to evaluate a list of variables that have significant effects on the price of airline tickets. To do this, we have reviewed research and analysis from others who have examined individual factors and price determinates, we will be using their findings as a basis for what data should be obtained and applied to our regression study evaluating how airline ticket prices vary based on these pre examined factors (and others we may find?). The primary factors we will be examining are scarcity of seats, the impact of oil prices and consumer strategies on the price of the seat.

Much of the research is based on the interactions between consumers and airlines pricing. To begin, we felt it was necessary to find evidence that consumers within the airline ticket market are strategic as assumed. I.e., Are the assumed interactions between price and consumers reasonable, or are they a result of other hidden factors? Li and Netessine (2014) provide an analysis of this base question. Their research seeks to answer primarily the idea of whether consumers are strategic within the airline industry to a significant degree, and secondarily what effects this may have on the revenues of airlines. Due to their work, we can say with relative confidence that strategic consumers do exist within this market space, specifically with an estimated share of 5.2% to 19.2% of the consumer market.1 The effects on total revenue due to their presence are more complicated, they found that nondecreasing price commitment strategies can reduce the level of strategic consumers although these same strategies lead to decreased levels of indifferent consumers as well1 (those who buy tickets simply because they are cheap, rather than waiting for prices to drop before buying). These findings show the effects of strategic consumers on revenue may be a relationship worth investigating as if understood it would allow for more exact price strategies that maximize revenues by controlling and shifting markets strategies based on flight patterns, trends, and consumers inputs.

This leaves the question, what timing is most strategic when selling tickets? To answer this, two things must be identified. The first is, which market (business or tourist) is the airline primarily selling to? The second is, which of three periods are the passengers purchase in? It seems like the ideal model sells a substantial portion of seats in the initial period to businesspersons who are less likely to cancel, then the model opens ticket purchases to the tourist market. Obviously, seating capacity is limited depending on the plane. When capacity is low airlines typically are “better off selling exclusively to business consumers, who have higher valuations and thus will pay more.”4

The final period of sales regards last minute deals. I.e., the idea that people can fill seats that would otherwise go unoccupied for a less expensive price than normal. This practice, while cutting into the profit that could be realized by the airline, usually cuts the costs that would have been seen by an unoccupied seat. This tool is usually only utilized when capacity is high and only on the day of the flight.

So in answer to what timing is most strategic when selling tickets? We found that airlines should likely price discriminate in the first period of purchasing, especially to the tourist market. In the second and third purchasing periods they should market moderately priced tickets to the business segment of the market, and finally in the day of, they need to utilize the most profitable preserving models i.e., “last minute deals” for unfilled seats.

This pricing strategy, however, changes with increases and decreases in flight frequencies. (Cattaneo, Mattia; Malighetti, Paolo; Redondi, Renato; Salanti, Andrea) The authors found that fare variations have a negative correlation with changes in the frequency of a flight. Simply said, frequent flights reduce an airline’s ability to price discriminate.

This leads to pricing strategies and consumer responses to them. Gregor Bischoff, Sven Maertens, and Wolfgang Grimme found that airlines will often charge more for a one-way ticket than they do for a round trip ticket causing consumers to buy a two-way ticket but skip the return trip to save money. As one could imagine, carriers are not very fond of this practice and do their best to curb it while maintaining the same price discriminatory practices. These authors also note the history of air travel and how demand has settled into a seasonal pattern where most people travel for holidays. Which leads into an analysis of basic elasticities of consumers, where a consumer purchasing a last-minute round-trip flight typically is a very inelastic consumer, whereas a consumer buying tickets months in advance for a vacation flight is more elastic and responsive to price. Then the authors explain some price discrimination methods such as offering discounts to flyers booking flights well in advance as well as offering a discount for last minute customers to book their return flight with the same airline.

Additionally, airlines also price discriminate via the day of the week a ticket is purchased. Steven Puller and Lisa Taylor found ticket prices to be lower on weekends than weekdays. They concluded this is due to people buying for leisurely purposes on weekends and thus are more price-elastic. i.e., people who are more sensitive to price changes.

Our goal in this paper is to find other factors that airlines use to price discriminate so one can predict the best possible time to buy any given flight. One thing not found is whether prices rise within the few minutes one begins looking at a flight.

One of the major variables not accounted for in their estimates is firm-to-firm competition. This is likely to have significant effects on pricing (through price elasticity) due to the oligopolistic nature of the market.

This article talks extensively about the radical shifts made by certain airlines to make seats more about supply and demand than class and comfort. This shift to a lower cost flight model has impacted the way that other higher cost airlines do business and for them to do some introspection on how they control prices.


Data Analysis

All data was obtain via the US Federal Aviation Database systems. Source

Due to the size of the data this document is created with code that randomly samples 20,000 points from our 11 million, as a result the charts below sometimes vary thus we refrained from interpreting them. Although the full data set is represented by the tabular outputs. Additionally when viewing regression outputs we were able to code in references to outputs as the values change while interpretations are static. The patterns and relationships observed remain constant across samples due to the strength and significance of our variables as well as the large sample size

Variable Definition

Variable Name Description
QUARTER Quarter (1-4)
ROUNDTRIP Round Trip Indicator (1=Yes)
ITIN_YIELD Itinerary Fare Per Miles Flown in Dollars (ITIN_FARE/MilesFlown).
PASSENGERS Number of Passengers
ITIN_FARE Itinerary Fare Per Person
DISTANCE_GROUP Distance Group, in 500 Mile Intervals
MILES_FLOWN Itinerary Miles Flown (Track Miles)
ITIN_GEO_TYPE Itinerary Geography Type, 0 = Contiguous Domestic (Lower 48 U.S. States Only) , 1 = Non-contiguous Domestic (Includes Hawaii, Alaska and Territories)
tabl3 <-"
| Transformed Variable Name | Original Variable Name | Description  | 
|--------------------|:---------------:|:--------------------------:|
| lPASSENGERS        | PASSENGERS    |  Log(PASSENGERS)   |
| SQRT_1over_DG_x_MF | DISTANCE_GROUP & MILES_FLOWN |   $\\sqrt{\\frac{1}{\\text{DISTANCE_GROUP} * \\text{MILES_FLOWN}}}$  |         
"


tabl3 %>% pander()
Transformed Variable Name Original Variable Name Description
lPASSENGERS PASSENGERS Log(PASSENGERS)
SQRT_1over_DG_x_MF DISTANCE_GROUP & MILES_FLOWN \(\sqrt{\frac{1}{\text{DISTANCE_GROUP} * \text{MILES_FLOWN}}}\)

Graphical Summaries

Overview

Yield by log(Passengers)

ggplot(samp %>% drop_na()) +
  geom_smooth(aes(x= lPASSENGERS, y=ITIN_YIELD, col=ROUNDTRIP)) +
  #geom_jitter(aes(x= lPASSENGERS, y=ITIN_YIELD, col=ROUNDTRIP), alpha= 0.05) +
  facet_grid(rows= ~ITIN_GEO_TYPE)+
  theme_bw()+
  labs(col = "Flight Type", title= "Yield by log(Passengers)")+ 
  xlab("Log(Passengers)") + ylab("Fare per mile per passenger (Dollars)")

Yield by Distance Groups

ggplot(samp %>% drop_na()) +
  #geom_jitter(aes(x= DISTANCE_GROUP, y=ITIN_YIELD, col=ROUNDTRIP), alpha= 0.0075) +
  geom_smooth(aes(x= DISTANCE_GROUP, y=ITIN_YIELD, col=ROUNDTRIP)) +
  facet_grid(rows= ~ITIN_GEO_TYPE)+
  theme_bw()+
  theme(
      panel.spacing = unit(0.5, "lines")
    )+ 
  labs(col = "Flight Type", title= "Yield by Distance")+ 
  xlab("Distance in intervals of 500") + ylab("Fare per mile per passenger (Dollars)")

Yield by Binaries

ggplot(data=samp, ) +
    geom_histogram(aes(x=ITIN_YIELD, fill= ROUNDTRIP)) +
    #geom_area(aes(x=HEPerGDP,y=child_mort, fill= continent))+
    theme_bw() +
    gghighlight(use_direct_label = FALSE) +
    facet_wrap(~ITIN_GEO_TYPE) +
    theme(
      panel.spacing = unit(0.5, "lines"),
      axis.ticks.x=element_blank()
    )+ 
  labs(fill = "Flight Type", title= "Distribution of Yields by Flight Types")+ 
  xlab("Fare per mile per passenger (Dollars)") + ylab("Count") 

Tabular Summaries

Overview

Yield ~ RoundTrip & Geo

pander(favstats(ITIN_YIELD ~  ROUNDTRIP + ITIN_GEO_TYPE, data=FullDat_Filt)[c("ROUNDTRIP.ITIN_GEO_TYPE", "Q1","median", "mean","Q3", "sd","n")], caption= "Summary table of Yields by Flight Type per Quarter")
Summary table of Yields by Flight Type per Quarter (continued below)
ROUNDTRIP.ITIN_GEO_TYPE Q1 median mean Q3 sd
One-Way.Continguous Domestic 0.1047 0.1738 0.2354 0.2908 0.2091
RoundTrip.Continguous Domestic 0.1015 0.1599 0.2063 0.2571 0.1657
One-Way.Non-Continguous Domestic 0.0709 0.1014 0.1423 0.1586 0.148
RoundTrip.Non-Continguous Domestic 0.0681 0.0942 0.1246 0.1337 0.1322
n
4025439
5803943
338790
438735

Yield ~ Distance Group

pander(favstats(ITIN_YIELD ~ DISTANCE_GROUP, data=FullDat_Filt)[c(1:5, 12:16, 23:25),c("DISTANCE_GROUP", "Q1","median", "mean","Q3", "sd","n")], caption= "Summary table of Yields by Flight Type per Quarter")
Summary table of Yields by Flight Type per Quarter
  DISTANCE_GROUP Q1 median mean Q3 sd n
1 1 0.3347 0.5259 0.6113 0.8075 0.3752 420249
2 2 0.1809 0.2847 0.3348 0.4358 0.2166 1675504
3 3 0.1301 0.2045 0.2357 0.3043 0.1472 1941759
4 4 0.1071 0.1659 0.1874 0.2405 0.1128 1759498
5 5 0.0901 0.136 0.1561 0.1984 0.09458 1631197
12 12 0.0601 0.0841 0.09356 0.1158 0.048 74074
13 13 0.0595 0.0819 0.08936 0.1104 0.04411 30455
14 14 0.0613 0.0829 0.09038 0.1134 0.04188 26418
15 15 0.0572 0.079 0.08533 0.1071 0.0393 14473
16 16 0.0611 0.0798 0.08534 0.1036 0.03466 22320
23 23 0.0552 0.06725 0.07131 0.0857 0.02555 258
24 24 0.0568 0.0681 0.07164 0.08505 0.02991 131
25 25 0.0631 0.0869 0.07687 0.1041 0.03513 377

Data Conculsions

  • Increasing variability in Passengers and distance, this will cause issues with our standard errors.
  • Increasing Distance, or passengers leads to decreased pricing and thus lower yields.
  • There is some variance between distributions of yields when examining flight location and type.

Methodology

Two regressions were created during our attempts to better understand the data and the relationships between our variables. The first uses at most simple transformations such as logs to help reduce heteroskedasticity. While the second employs more abstract calculus transformations in order to linearize any variable previously used that did not initially hold a simple linear pattern with our endogenous variable.

Variable Overview

Our Variables

Variables

Full Pairs charts

panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
test <- cor.test(x,y)
# borrowed from printCoefmat
Signif <- symnum(test$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
text(0.5, 0.5, txt, cex = 1.5 )
text(.7, .8, Signif, cex=cex, col=2)
}

pairs(samp, lower.panel=panel.smooth, upper.panel=panel.cor)

Standard Regression

Initial Regression Model

\[ \underbrace{Y_i}_\text{Itinerary Yield} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{Base Yield}}} + \overbrace{\beta_1}^{\stackrel{\text{slope along}}{\text{lPassenger}}} \underbrace{X_{1i}}_\text{lPassenger} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{Distance Group} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{3i}}_\text{Roundtrip} + \overbrace{\beta_4}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{4i}}_\text{Non-Continguous} +\overbrace{\beta_5}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{lPassenger:Distance Group} + \epsilon_i \]

Results

lm1 <- lm(ITIN_YIELD ~ lPASSENGERS + DISTANCE_GROUP + ROUNDTRIP + ITIN_GEO_TYPE + lPASSENGERS:DISTANCE_GROUP , data= samp)

summary(lm1) %>% pander
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3562 0.002581 138 0
lPASSENGERS -0.03408 0.003598 -9.474 2.993e-21
DISTANCE_GROUP -0.03444 0.0005024 -68.57 0
ROUNDTRIPRoundTrip 0.04921 0.002572 19.13 7.417e-81
ITIN_GEO_TYPENon-Continguous Domestic 0.08026 0.004928 16.29 3.021e-59
lPASSENGERS:DISTANCE_GROUP -0.002455 0.000801 -3.065 0.002179
Fitting linear model: ITIN_YIELD ~ lPASSENGERS + DISTANCE_GROUP + ROUNDTRIP + ITIN_GEO_TYPE + lPASSENGERS:DISTANCE_GROUP
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
20000 0.1604 0.2287 0.2285
lm1_r2 <- round(summary(lm1)$adj.r.squared, 2)
lm1_RSE <- round(sigma(lm1)*100, 1)
matrix_coef <- summary(lm1)$coefficients
my_estimates <- matrix_coef[ , 1] 
b0 <- round(my_estimates[1]*100, 2)
b1 <- round(my_estimates[2]*100, 2)
b2 <- round(my_estimates[3]*100, 2)
b3 <- round(my_estimates[4]*100, 2)
b4 <- round(my_estimates[5]*100, 2)
b5 <- round(my_estimates[6], 2)

Our initial regression model using ordinary least squares results in an \(R^2\) of 0.23, which in the scope of our data is fairly substantial, airline pricing is incredibly varies and involved hundreds of possible factors, we have access to a very limited number of factors and thus are only able to account for total variation to a very limited extent. Though, our residual Standard error is less than ideal when taken in context, an error of 16 cents in yields is a large percentage of our total yield range ($0.05-$2), 0.08% of our total range to be specific.

Skipping the y-intercept as its interpretation would make little realistic sense in this case, specific coefficient interpretations are as follows;

  • For every 1% increase in itinerary passengers we see a decline in yield of -3.41 cents

  • For every 500 additional miles on an Itinerary we see a -3.44 cent decline in yield.

  • Roundtrip flights on average provide an additional 4.92 cent yield.

  • Domestic (Non-Continguous) flights on average yield 8.03 cents more per mile.

  • For each 1% increase in passenger count we see a 0 decline in the distance of a flight.

Assumptions

As the data is not a time series we limited our testing to only Heteroskedasticity and multi-collinearity.

Below are the results from a Breush-Pagan Test:

bptest(lm1)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm1
## BP = 670.81, df = 5, p-value < 2.2e-16

Despite the transformations made on passengers, significant error variance is still present. This is likely due to the increasing variability over increasing X as well as miss-specification errors due to omitting significant variables.

Due to concerns about high correlation between our variables we tested for Multi-collinearity as well:

vif(lm1)
##                lPASSENGERS             DISTANCE_GROUP 
##                   4.006313                   1.664075 
##                  ROUNDTRIP              ITIN_GEO_TYPE 
##                   1.244330                   1.291236 
## lPASSENGERS:DISTANCE_GROUP 
##                   3.926802

As none of our values are greater than 10 we should not be worried about multi-collinearity.

Robust Least Squares Model

In order to allow for a true BLUE regression we calculated the coefficients using robust least squares. As shown below the skeleton of the model remains the same though the methods used to calculate coefficients now apply a weighting system assigning less weight to outlying points than standard OLS.

\[ \underbrace{Y_i}_\text{Itinerary Yield} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{Base Yield}}} + \overbrace{\beta_1}^{\stackrel{\text{slope along}}{\text{lPassenger}}} \underbrace{X_{1i}}_\text{lPassenger} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{Distance Group} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{3i}}_\text{Roundtrip} + \overbrace{\beta_4}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{4i}}_\text{Non-Continguous} +\overbrace{\beta_5}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{lPassenger:Distance Group} + \epsilon_i \]

Results

As shown in the output below the relationships of our exogenous variables to our endogenous variable yield remain the same although the degree to which each of these variables affects the yield has somewhat shifted.

Robust Standard errors:

coeftest(lm1, vcov = vcovHC(lm1, type= 'HC1'))
## 
## t test of coefficients:
## 
##                                          Estimate  Std. Error  t value
## (Intercept)                            0.35617275  0.00369938  96.2791
## lPASSENGERS                           -0.03408281  0.00462170  -7.3745
## DISTANCE_GROUP                        -0.03444474  0.00065497 -52.5900
## ROUNDTRIPRoundTrip                     0.04920797  0.00275301  17.8743
## ITIN_GEO_TYPENon-Continguous Domestic  0.08025764  0.00509773  15.7438
## lPASSENGERS:DISTANCE_GROUP            -0.00245513  0.00098655  -2.4886
##                                        Pr(>|t|)    
## (Intercept)                           < 2.2e-16 ***
## lPASSENGERS                           1.714e-13 ***
## DISTANCE_GROUP                        < 2.2e-16 ***
## ROUNDTRIPRoundTrip                    < 2.2e-16 ***
## ITIN_GEO_TYPENon-Continguous Domestic < 2.2e-16 ***
## lPASSENGERS:DISTANCE_GROUP              0.01283 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In addition the the simple robust estimates, due to the extremity of our Breush-Pagan results we felt it would also be useful to calculate 95% confidence intervals for our estimators and be doubly sure that they remained interpretable and useful. As shown all estimates retain the same signs and are thus safe to include and utilize in a model.

Robust Coefficients at 95% confidence:

coefci(lm1, vcov = vcovHC(lm1, type= 'HC1'))
##                                              2.5 %        97.5 %
## (Intercept)                            0.348921659  0.3634238348
## lPASSENGERS                           -0.043141731 -0.0250238851
## DISTANCE_GROUP                        -0.035728532 -0.0331609488
## ROUNDTRIPRoundTrip                     0.043811842  0.0546040891
## ITIN_GEO_TYPENon-Continguous Domestic  0.070265660  0.0902496208
## lPASSENGERS:DISTANCE_GROUP            -0.004388852 -0.0005214142

Transformed Regression

In this transformed model the non-simple linear relationship between distance group, miles flown and yields was transformed into a simple linear relationship, refer to variable overview. The implications of this are further expanded upon below.

Initial Regression Model

\[ \underbrace{Y_i}_\text{Itinerary Yield} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{Base Yield}}} + \overbrace{\beta_1}^{\stackrel{\text{slope along}}{\text{lPassenger}}} \underbrace{X_{1i}}_\text{lPassenger} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{SQRT_1over_DG_x_MF} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{3i}}_\text{Roundtrip} + \overbrace{\beta_4}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{4i}}_\text{Non-Continguous} +\overbrace{\beta_5}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{lPassenger:SQRT_1over_DG_x_MF} + \epsilon_i \]

So as to best maintain the ability to compare the two regression all variables where kept the same except for the replacement of Distance_Group with the new transformed variable.

Results

lm2 <- lm(ITIN_YIELD ~ lPASSENGERS + SQRT_1over_DG_x_MF + ROUNDTRIP + ITIN_GEO_TYPE + lPASSENGERS:SQRT_1over_DG_x_MF, data= samp)
summary(lm2) %>% pander
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.001708 0.002621 0.6515 0.5148
lPASSENGERS -0.01332 0.002713 -4.911 9.125e-07
SQRT_1over_DG_x_MF 12.81 0.1118 114.5 0
ROUNDTRIPRoundTrip 0.06455 0.002124 30.39 2.742e-198
ITIN_GEO_TYPENon-Continguous Domestic 0.006699 0.003779 1.773 0.07631
lPASSENGERS:SQRT_1over_DG_x_MF -2.016 0.1278 -15.77 1.149e-55
Fitting linear model: ITIN_YIELD ~ lPASSENGERS + SQRT_1over_DG_x_MF + ROUNDTRIP + ITIN_GEO_TYPE + lPASSENGERS:SQRT_1over_DG_x_MF
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
20000 0.1366 0.4401 0.44
lm2_r2 <- round(summary(lm2)$adj.r.squared, 2)
lm2_RSE <- round(sigma(lm1)*100, 1)
matrix_coef <- summary(lm2)$coefficients
my_estimates <- matrix_coef[ , 1] 
b0 <- round(my_estimates[1]*100, 2)
b1 <- round(my_estimates[2]*100, 2)
b2 <- round(my_estimates[3], 2)
b3 <- round(my_estimates[4]*100, 2)
b4 <- round(my_estimates[5]*100, 2)
b5 <- round(my_estimates[6], 2)

Our transformed regression model using ordinary least squares results in an \(R^2\) of 0.44, which in the scope of our data is fairly substantial, airline pricing is incredibly varies and involved hundreds of possible factors, we have access to a very limited number of factors and thus are only able to account for total variation to a very limited extent. Though, our residual Standard error is less than ideal when taken in context, an error of 16 cents in yields is a large percentage of our total yield range ($0.05-$2), 0.08% of our total range to be specific. The Primary issue with this is that we lose the ability to effectively interpret a change in distance due to the complexity of the transformation.

Skipping the y-intercept as its interpretation would make little realistic sense in this case, specific coefficient interpretations are as follows;

  • For every 1% increase in itinerary passengers we see a decline in yield of -1.33 cents

  • For every 1 unit increase in \(\text{(Miles Flown * Distance group)}^{-\frac{1}{2}}\) on an Itinerary we see a 12.81 dollar increase in yield.

  • Roundtrip flights on average provide an additional 6.46 cent yield.

  • Domestic (Non-Continguous) flights on average yield 0.67 cents more per mile, but are no longer significant.

  • For each 1% increase in passenger count we see a -2.02 unit decline in \(\text{(Miles Flown * Distance group)}^{-\frac{1}{2}}\) of a flight.

Assumptions

As the data is not a time series we limited our testing to only Heteroskedasticity and multi-collinearity.

bptest(lm2)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm2
## BP = 2594.9, df = 5, p-value < 2.2e-16

Despite the transformations made on passengers and the attempt to linearize Distance, significant error variance is still present, in this case even more so than before. This is likely due to the increasing variability over increasing X as well as miss-specification errors due to omitting significant variables.

Due to concerns about high correlation between our variables we tested for Multi-collinearity as well:

vif(lm2)
##                    lPASSENGERS             SQRT_1over_DG_x_MF 
##                       3.138761                       1.529212 
##                      ROUNDTRIP                  ITIN_GEO_TYPE 
##                       1.169253                       1.046113 
## lPASSENGERS:SQRT_1over_DG_x_MF 
##                       3.614497

As none of our values are greater than 10 we should not be worried about multi-collinearity.

Robust Least Squares Model

Results

Again due to the issues found in our assumptions we calculated Robust standard errors to use rather than traditional OLS.

Robust Standard errors:

As shown in the output below the relationships of our exogenous variables to our endogenous variable yield remain the same although the degree to which each of these variables affects the yield has somewhat shifted.

coeftest(lm2, vcov = vcovHC(lm2, type= 'HC1'))
## 
## t test of coefficients:
## 
##                                         Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                            0.0017075  0.0039266  0.4349 0.6636747
## lPASSENGERS                           -0.0133249  0.0037126 -3.5891 0.0003326
## SQRT_1over_DG_x_MF                    12.8061241  0.2403688 53.2770 < 2.2e-16
## ROUNDTRIPRoundTrip                     0.0645521  0.0023613 27.3371 < 2.2e-16
## ITIN_GEO_TYPENon-Continguous Domestic  0.0066994  0.0030091  2.2264 0.0260013
## lPASSENGERS:SQRT_1over_DG_x_MF        -2.0156493  0.2506270 -8.0424 9.294e-16
##                                          
## (Intercept)                              
## lPASSENGERS                           ***
## SQRT_1over_DG_x_MF                    ***
## ROUNDTRIPRoundTrip                    ***
## ITIN_GEO_TYPENon-Continguous Domestic *  
## lPASSENGERS:SQRT_1over_DG_x_MF        ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Robust Coefficients at 95% confidence:

Again, in addition the the simple robust estimates, due to the extremity of our Breush-Pagan results we felt it would also be useful to calculate 95% confidence intervals for our estimators and be doubly sure that they remained interpretable and useful. As shown all estimates retain the same signs and are thus safe to include and utilize in a model with the exception of our intercept and geography types.

coefci(lm2, vcov = vcovHC(lm2, type= 'HC1'))
##                                              2.5 %       97.5 %
## (Intercept)                           -0.005989046  0.009404067
## lPASSENGERS                           -0.020601965 -0.006047849
## SQRT_1over_DG_x_MF                    12.334981337 13.277266854
## ROUNDTRIPRoundTrip                     0.059923640  0.069180473
## ITIN_GEO_TYPENon-Continguous Domestic  0.000801263  0.012597614
## lPASSENGERS:SQRT_1over_DG_x_MF        -2.506898922 -1.524399746

Regression plot

#Graph Resolution (more important for more complex shapes)
graph_reso <- 0.025

#Setup Axis
axis_x <- seq(min(samp$DISTANCE_GROUP), max(samp$DISTANCE_GROUP), by = graph_reso)
axis_y <- seq(min(samp$lPASSENGERS), max(samp$lPASSENGERS), by = graph_reso)
axis_col <- as.factor(c("One-Way", "RoundTrip"))
axis_f <- as.factor(c("Continguous Domestic", "Non-Continguous Domestic"))

#Sample points
lmnew <- expand.grid(DISTANCE_GROUP = axis_x, lPASSENGERS = axis_y, ROUNDTRIP = axis_col, ITIN_GEO_TYPE = axis_f ,  KEEP.OUT.ATTRS=F)
lmnew$Z <- predict.lm(lm1, newdata = lmnew)
lmnew <- acast(lmnew, lPASSENGERS ~ DISTANCE_GROUP , value.var = "Z") #y ~ x
samp %>% 
  filter(ITIN_GEO_TYPE == "Continguous Domestic") %>%
  plot_ly(., 
               x = ~DISTANCE_GROUP, 
               y = ~lPASSENGERS, 
               z = ~ITIN_YIELD, 
               #text = rownames(samp %>% drop_na()),
               type = "scatter3d",
               mode ="markers",
               color = ~as.factor(ROUNDTRIP),
               alpha= 0.7) %>%
              layout(title= list(text = "Continguous Domestic Flights (Lower 48)"))
samp %>% 
  filter(ITIN_GEO_TYPE == "Non-Continguous Domestic") %>%
  plot_ly(., 
               x = ~DISTANCE_GROUP, 
               y = ~lPASSENGERS, 
               z = ~ITIN_YIELD, 
               #text = rownames(samp %>% drop_na()),
               type = "scatter3d",
               mode ="markers",
               color = ~as.factor(ROUNDTRIP),
               alpha= 0.7) %>%
              layout(title= list(text = "Non-Continguous Domestic Flights (Outside Lower 48)"))

Conclusions and Avenues for Future Research

Rough idea ->

increasing passengers does lead to decreasing profits, likely through the assumed discounts that occur from bulk purchasing.

increasing distances also lead to reducing profits as the flight lasts longer. this is likely related to fixed costs as a percentage of total costs

lastly when comparing one-way vs round trips we see that continguous flights are more likely to provide greater profits on one-way flights relative to non-continguous flights


Literature Cited

Link

Link

Link

Link

Link